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Neural populations encode information about their stimulus in a collective fashion, by joint activ- 
ity patterns of spiking and silence. A full account of this mapping from stimulus to neural activity 
is given by the conditional probability distribution over neural codewords given the sensory input. 
To be able to infer a model for this distribution from large-scale neural recordings, we introduce a 
stimulus-dependent maximum entropy (SDME) model — a minimal extension of the canonical linear- 
nonlinear model of a single neuron, to a pairwise-coupled neural population. The model is able to 
capture the single-cell response properties as well as the correlations in neural spiking due to shared 
stimulus and due to effective neuron-to-neuron connections. Here we show that in a population 
of 100 retinal ganglion cells in the salamander retina responding to temporal white- noise stimuli, 
dependencies between cells play an important encoding role. As a result, the SDME model gives 
a more accurate account of single cell responses and in particular outperforms uncoupled models 
in reproducing the distributions of codewords emitted in response to a stimulus. We show how the 
SDME model, in conjunction with static maximum entropy models of population vocabulary, can 
be used to estimate information-theoretic quantities like surprise and information transmission in a 
neural population. 



INTRODUCTION 

Neurons represent and transmit information using 
temporal sequences of short stereotyped bursts of electri- 
cal activity, or spikes [1 j. Much of what we know about 
this encoding has been learned by studying the mapping 
between stimuli and responses at the level of single neu- 
rons, and building detailed models of what stimulus fea- 
tures drive a single neuron to spike [2]-[4]. In most of the 
nervous system, however, information is represented by 
joint activity patterns of spiking and silence over pop- 
ulations of cells. In a sensory context, these patterns 
can be thought of as codewords that convey informa- 
tion about external stimuli to the central nervous system. 
One of the challenges of neuroscience is to understand the 
neural codebook — a map from the stimuli to the neural 
codewords — a task made difficult by the fact that neu- 
rons respond to the stimulus neither deterministically nor 
independently. 

The structure of correlations among the neurons deter- 
mines the organization of the code, that is, how different 
stimuli are represented by the population activity [SHE]. 
These correlations also determine what the brain, having 
no access to the stimulus apart from the spikes coming 
from the sensory periphery, can learn about the outside 
world [SHU]. The source of these correlations, which 
arise either from the correlated external stimuli to the 
neurons, from "shared" local input from other neurons, 



* gtkacik@ist.ac.at 
=,= equal contributions 



or from "private" independent noise, have been heavily 
debated fT2UT5] . In many neural systems, the correla- 
tion between pairs of (even nearby or functionally simi- 
lar) neurons was found to be weak p~6j [T71 [26] . Similarly, 
the redundancy between pairs in terms of the information 
they convey about their stimuli was also typically weak 
[T8H20] . The low correlations and redundancies between 
pairs of neurons therefore led to the suggestion that neu- 
rons in larger populations might encode information in- 
dependently [2T] . which was echoed by theoretical ideas 
of maximally efficient neural codes [22H24] . 

Recent studies of the neural code in large populations 
have, however, revealed that while the typical pairwise 
correlations may be weak, larger populations of neurons 
can nevertheless be strongly correlated as a whole [25]- 
[33] . Maximum entropy models of neural populations 
have shown that such strong network correlations can be 
the result of collective effects of pairwise dependencies 
between cells, and, in some cases, of sparse high-order 
dependencies [26j [34] [35]. Most of these studies have 
characterized the strength of network effects and spik- 
ing synchrony at the level of the total vocabulary of the 
population, i.e. the distribution of codewords averaged 
over all the stimuli. It is not immediately clear how these 
findings affect stimulus encoding, where one needs to dis- 
tinguish the impact of correlated stimuli that the cells 
receive ("stimulus correlations"), from the impact of co- 
variance of the cells conditional on the stimulus ("noise 
correlations"). For small populations of neurons, it has 
been shown that taking into account correlations for de- 
coding or reconstructing the stimulus can be beneficial 
compared to the case where correlations are neglected 
(e.g. [35-39 ). Similarly, generalized linear models high- 



2 



lighted the importance of dependencies between cells in 
accounting for correlations between pairs and triplets of 
retinal ganglion cell responses [40] . 

Here we present a new encoding model that allows us 
to study in fine detail the codebook of large neural popu- 
lations. Importantly, this model gives a joint probability 
distribution over the activity patterns of the whole pop- 
ulation for a given stimulus, while capturing both the 
stimulus and noise correlations. This new model belongs 
to a class of maximum entropy models with strong links 
to statistical physics [27, 41-52 and is directly related 
to maximum entropy models of neural vocabulary [26]- 
[32] , allowing us estimate the entropy and its derivative 
quantities for the neural code. In sum, the maximum en- 
tropy framework enables us to progress towards our goal 
of focusing attention on the level of joint patterns of ac- 
tivity, rather than capturing low- level statistics (e.g., the 
individual firing rates) of the neural code alone. 

We start by showing that linear-nonlinear (LN) mod- 
els of retinal ganglion cells responding to spatially un- 
structured stimuli capture a significant part of the single 
neuron response, but still miss much of the detail; in 
particular, we show that they fail to capture the correla- 
tion structure of firing among the cells. We next present 
our new stimulus- dependent maximum entropy (SDME) 
model, which is a hybrid between linear-nonlinear mod- 
els for single cells and the pairwise maximum entropy 
models. Applied to groups of ~ 100 neurons recorded si- 
multaneously, we find that SDME models outperform the 
LN models for the stimulus-response mapping of single 
cells and, crucially, give a significantly better account of 
the distribution of codewords in the neural population. 

RESULTS 

We recorded the simultaneous spiking activity of ~ 110 
ganglion cells from the salamander retina [53] , presented 
with repeats of a 10 s long full- field flicker ("Gaussian 
FFF") movie, where the light intensity on the screen 
was sampled independently from a Gaussian distribution 
with a frequency of 30 Hz (Fig. [T^i). This "frozen noise" 
stimulus was repeated 726 times, for a total of ~ 2 h of 
stimulation. Most of the recorded cells exhibited tempo- 
ral OFF-like behaviors (Fig. [TJd). We chose for further 
analysis N = 100 cells that were reliably sorted, demon- 
strated a robust and stable response over repeats, and 
generated at least 2500 spikes during the course of the 
experiment. 

We discretized neural responses into At = 10 ms bins, 
and represented the activity of the neurons in response 
to the stimulus as binary words in each of the time 
bins. If neuron i = 1, ... ,7V was active in time bin £, 
we denoted a spike (or more spikes) as od (t) = 1, and 
Xi(t) = if it was silent. In this representation, the 
whole experiment yielded a total of about T ~ 7.3 • 10 5 
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FIG. 1: Response of a large population of ganglion 
cells to a 10 s long repeated visual stimulus, (a) White 
noise uncorrelated Gaussian stimulus presented at 30 Hz and 
the spiking patterns of 3 cells to repeated presentations of the 
stimulus, (b) Spike-trigerred averages of 110 simultaneously 
recorded cells; a subset of 100 cells was chosen for further 
analysis, (c) The histogram of pairwise correlations between 
cells for repeated Gaussian white noise stimulus (green), re- 
peated natural pixel movie (red), and non-repeated natural 
pixel movie (blue) |35j . (d) Average pairwise correlation co- 
efficient between cells as a function of the distance (mean and 
std are across pairs of cells at a given distance). 



binary word samples. Using repeated presentations of the 
same movie, we estimated the average response of each 
of the cells across repeats, ri(t) = (xi(t)) rep , or the peri- 
stimulus time histogram (PSTH). Following Refs. [4|l54]. 
we fitted a linear-nonlinear model for each of the cells in 
the experiment, such that the predicted rate rf N (t) = 
J\fi(ki • s(t)), where Iq is a linear filter matched for the 
i-th cell, Mi is its point-wise nonlinear function, and s(t) 
is the stimulus fragment from time t — r until t (here 
we used r = 400 ms, making s(t) a vector of light in- 
tensities with 40 components). Linear filters were recon- 
structed using reverse correlation (spike-triggered aver- 
age), and nonlinearities were obtained by histograming 
P(lq • s(t)|spike) into K = 20 adaptively-sized bins and 
obtaining rf N (t) = M(k; • s) = P(spike|k i • s(t)) by in- 
verting P(ki • s(t)| spike) using Bayes' rule. These LN 
models captured most of structure of the PSTH, yet as 
the example cell in Fig. [2^i shows, they often misesti- 
mated the exact firing rates of the neuron, or sometimes 
even missed parts of the neural response altogether. In 
the Gaussian FFF condition, the normalized (Pearson) 
correlation between the measured and predicted PSTH, 
Corr(r^(t), r^ N (t)), was 0.69 ± 0.06 (mean ± std across 
100 cells). 

The performance gap of the canonical LN models in 
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predicting single neuron responses suggests that either 
the single-neuron models need to be improved to account 
for the observed behavior, or that interactions between 
neurons play an important encoding role and need to be 
included. While firing rate prediction performance can 
be improved for single neurons by models with higher- 
dimensional stimulus sensitivity (e.g. [54, 55 j) or dynam- 
ical aspects of spiking behavior (e.g. [56l [57]), previous 
work, as well as the results below, demonstrated that 
even conditionally-independent models which by con- 
struction perfectly reproduce the firing rate behavior of 
single cells, often fail to capture the measured correla- 
tion structure of firing between pairs of cells, as well as 
higher-order statistical structure [26] . 

We find two salient features of the correlations between 
pairs of neurons: (i) the pairwise correlations between 
cells in response to the Gaussian FFF movie are typi- 
cally weak but are not zero (Fig. consistently with 
previous reports [26 , 2TJE2]); (ii) the correlation in neu- 
ral activities shows a fast decay with distance despite the 
infinite correlation length of the stimulus, but the decay 
does not reach zero correlation even at relatively large 
distances (Fig. [l]i). This salient structure, along with 
any other potential statistical correlation at the pairwise 
order, is characterized by the covariance matrix of ac- 
tivities, Cij = (xiXj) — (xi)(xj), where the averages are 
taken across time and repeats. 

We would like to find a model of the neural code that 
will be able to reproduce these pairwise statistics. Mo- 
tivated by the shortcomings of the canonical LN model, 
we therefore asked whether a model that combined the 
LN (receptive-field based) aspect of single cells with the 
interactions between cells, could give a better account 
of the neural stimulus-response mapping. Importantly, 
the new model should capture not only the firing rate 
of single cells and the full pairwise correlation structure 
between them, but should also accurately predict the full 
distribution of the joint activity patterns across the whole 
population. Because the joint distributions of activity are 
high-dimensional (e.g., the distribution over codewords 
across the duration of the experiment, P({a^}), has 2^ 
components), this is a very demanding benchmark for 
any model. 

Here we propose the simplest extension to the 
conditionally-independent set of LN models for each cell 
in the recorded population, by including pairwise cou- 
plings between cells, so that the spiking of cell i can 
increase or decrease the probability of spiking for cell 
j [58] [59] . In contrast to previous proposals, this cou- 
pling will be introduced so that the resulting model is a 
maximum-entropy model for P({x^}|s), the conditional 
distribution over population activity patterns given the 
stimulus. We recall that the maximum entropy models 
give the most parsimonious probabilistic description of 
the joint activity patterns, which perfectly reproduces 
a chosen set of measured statistics over these patterns, 



without making any additional assumptions [60] . 

We start by introducing the least structured (maxi- 
mum entropy) distribution P(x±, x^ • • . , X]\r\i) that re- 
produces exactly the observed average firing rate for each 
time bin t and for each neuron z, ri(t) = (xi(t))data — 
( x i(t))p, as well as the overall correlation matrix dj be- 
tween all pairs of cells (c.f. [61 J. Thus, we seek P({xi}\t) 
that maximizes £: 

£[P({xi}\t)] = ~ E P({xi}\t)]og 2 P({xi}\t) 

+ y^ai(t)[(Xi(t)) P - (Xi(i))data\ 
i,t 



]T Kt)[P({xi}\t) - 1], 

{ Xi },t 



(1) 



where the subscript to brackets (•) denotes whether the 
averaging is done over the maximum entropy distribu- 
tion (P), or over the recorded data; Lagrange multipliers 
H ensure that the distributions are normalized. This is 
an optimization problem for parameters oti(t) and /^-, 
which has a unique solution since the entropy is convex. 
The functional form of the solution to this optimization 
problem is well-known; in our case it can be written as 
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where the individual time-dependent parameters for 
each of the cells, o^(t), and the stimulus-independent 
pairwise interaction terms fy, are set to match the 
measured firing rates n(i) and the pairwise correla- 
tions Cij] Z(t) is a normalization factor or parti- 
tion function for each time bin t, given by Z(t) = 

E{^} exp (£; OLi(t)xi + \ J2ij PijXiXj) . 

The time-dependent maximum entropy (TDME) 
model in Eq. ^ is equivalent to an Ising model 
from physics, where the single-cell parameters are time- 
dependent local fields acting on each of the neurons 
(spins), and static (stimulus-independent) infinite-range 
interaction terms couple each pair of spins. In the 
limit where interactions go to zero, fyj = 0, the model 
in Eq. ^ becomes the full conditionally-independent 
model, itself a maximum entropy model that reproduces 
exactly the firing rate of every neuron, ri(t)\ in this case 
the probability distribution factorizes, and the solution 
for ai{t) and Z(t) becomes trivially computable from the 
firing rates, ri{t). Time-dependent maximum entropy 
models are powerful, since they make no assumptions 
about how the stimulus drives the response; they can 
serve as useful benchmarks for other models (especially 
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the conditionally independent model with = 0). On 
the other hand, these models require repeated stimulus 
presentations to fit, involve a number of parameters that 
grows linearly with the duration of the stimulus, do not 
generalize to new stimuli, and do not provide an explicit 
map from the stimuli to the responses. 

To make a direct link to the stimulus and allow com- 
parison with a set of uncoupled LN models, we take 
the general time-dependent maximum entropy model of 
Eq. ([2| and specialize it to a particular form of stimu- 
lus dependence. Rather than having an arbitrary time- 
dependent parameter for every neuron for each time 
bin, ai(i), we assume that this dependence takes place 
through the stimulus projection alone, i.e. cti(t) = 
ai(ki • s(t)), much like in an LN model, where the neu- 
ral firing depends on the value of the stimulus projec- 
tion onto the linear filter k^. This choice is made purely 
for the sake of convenience: the model could be gener- 
alized to, e.g., neurons that depend on two linear pro- 
jections of the stimulus, by making depend jointly on 
(ki • s(t),k2 • s(t)), although such models would be pro- 
gressively more difficult to infer from data. 

Concretely, we estimated the linear filter Iq for each 
cell i using reverse correlation, and convolved the filter 
with the stimulus sequence, s(t), to get the "generator 
signal" gi(t) — Iq • s(t). We then looked for the max- 
imum entropy probability distribution P({a^}|s(t)), by 
requiring that the average firing rate of every cell given 
the generator signal is the same in the data and under 
the model, i.e. (xi(gi)) data = (xi(gi)) P (see Methods); 
as before, we also required that the model reproduced 
the overall correlation between every pair of cells, Cij. 
This gives then a stimulus-dependent maximum entropy 
(SDME) model, which takes the following form: 

P SDME ({x i }\s(t)) = (3) 
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The parameters of this model are: N x (N — l)/2 cou- 
plings fiij, K x N parameters ctj, and a linear filter k$ 
for each cell. We used a Monte Carlo based gradient de- 
scent learning procedure to find the model parameters 
a,f3 numerically (see Methods). 

By construction, the SDME model exactly reproduces 
the covariance in activities, Cij, between all pairs of cells, 
and also the LN model properties of every cell: an arbi- 
trary nonlinear function M can be encoded by properly 
choosing how parameters depend on the linear projec- 
tions of the stimulus, g^. We can construct a maximum 
entropy model with = (no constraints on the pair- 
wise correlations Cij). The result is a set of uncoupled 
(conditionally independent) LN models: 
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FIG. 2: SDME model predicts the firing rate of single 
cells better than LN models, (a) Example of the PSTH 
segment for one cell (blue) , the best prediction of an LN model 
(green) and of a SDME model (red), (b) Correlation between 
the true PSTH and SDME model prediction (vertical axis) vs. 
the correlation between the true PSTH and the LN model 
prediction (horizontal axis); each plot symbol is a separate 
cell, dotted line shows equality. The neuron chosen in panel 
(a) is shown in orange. 
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In sum, the time-dependent maximum entropy (TDME) 
model of Eq. ([2| is an extension of conditionally indepen- 
dent model that additionally reproduces the measured 
pairwise correlations between cells. In a directly anal- 
ogous way, the stimulus-dependent maximum entropy 
(SDME) model of Eq. ^ is an extension to the set of 
uncoupled LN models, Eq. Q, that additionally repro- 
duces the measured pairwise correlations between cells. 
Because P SDME (Eq.[3) agrees with P LN (Eq.[Z} exactly 
in all constrained single-neuron statistics, any improve- 
ment in prediction of the SDME, be it in the firing rate 
or the codeword distributions, can be directly ascribed 
to the effect of the interaction terms, fiij. 

We found that the SDME predicted the firing rate of 
single cells better than the LN models, with the nor- 
malized correlation coefficient between the true and pre- 
dicted firing rate, Corr (n(t), rf DME (t)) being 0.74±0.06 
(mean ± std across 100 cells), as shown in Fig. [2J3. 
The differences between the SDME and the LN models 
become more striking on the level of the activity pat- 
terns of the whole population. Figures [3^i,b show the 
log-likelihood ratio for the population activity patterns 
x = {xi} under the two models, showing that the SDME 
can be orders of magnitude better in predicting the pop- 
ulation neural response. These differences are large in 
particular for those stimuli that elicit a strong response 
(Fig. [3^), that is, precisely where the response consists of 
synchronous spiking and the structure of the codewords 
can be nontrivial. Moreover, the difference between the 
models becomes increasingly significant with the size of 
the population TV, and particularly apparent for groups 
of more than 20 cells (Fig. 
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FIG. 3: SDME model predicts population activity patterns for N — 100 neurons better than conditionally 
independent LN models, (a) The log-likelihood ratio of the population firing patterns under the SDME model and under a 
collection of LN models, shown as a function of time (red) for an example stimulus repeat. For reference, the average population 
firing rate is shown in grey, (b) A scatterplot of the log- likelihoods under the SDME and LN models; each dot represents an 
average over activity patterns {xi} (across repeats) at a given time bin; dotted line shows equality, (c) The log-likelihood ratio 
under the SDME and LN models as a function of the average firing rate in the population; SDME outperforms LN models in 
particular for patterns with more spiking activity, (d) The average likelihood ratio between the SDME and LN models as a 
function of the population size, N (error bars = std over 10 randomly chosen groups of neurons at that N). 



We next examined how well various models for the neu- 
ral codebook, P({a^}|s), explain the total vocabulary, 
that is, the distribution of neural codewords observed 
across the whole duration of the experiment, P({xi}) = 
(P({xi}\s(t))) t . Despite the nominally large space of pos- 
sible codewords — much larger than the total number of 
samples in the experiment (2^ ^> T) — the sparsity of 
spikes and the correlations between neurons restrict the 
vocabulary to a much smaller set of patterns. Some of 
these occur many times during our stimulus presenta- 
tion, allowing us to estimate their empirical probability, 
P data ({xi}), directly from the experiment, and compare 
it to the model prediction [35]. The most prominent ex- 
ample of such frequently observed codewords is the silent 
pattern, X{ = 0, which is seen ~ 72% of the time. 

Figure [4] shows the likelihood ratio of the model prob- 
ability and empirical probability for various codewords 
observed in the experiment, as a function of the rate at 
which these codewords appear in the experiment. While 
SDME model in Fig. [4^i does not predict the frequencies 
of all patterns perfectly, it strongly outperforms the un- 
coupled set of LN models in Fig. [IJd, and has a slightly 
better performance than the full conditionally indepen- 
dent model (Fig. [4^), despite the fact that the latter is 
determined by N x 1000 = 1 • 10 5 parameters, the fir- 
ing rates of every cell in every time bin. On average, 
SDME predicts the probabilities of the patterns of ac- 
tivity with no bias, and with a standard deviation of 
\ og (pSDME jpdata^ of about 1; uncoup i e d LN models in 

comparison are biased and have a spread that is more 
than twice as large. Even more striking is the fact that 
LN models assign such low probabilities to some code- 
words that they are never generated during our Monte 
Carlo sampling (and are therefore not even shown in scat- 
terplots of Fig. [4|, while they are frequently observed in 
the experiment. This discrepancy is quantified by enu- 



merating the M most probable patterns in the data and 
in the model (by sampling; see Methods), and measuring 
the size of the intersection of the two sets of patterns; 
in other words, we ask if the model is even able to ac- 
cess all the patterns that one is likely to record in the 
experiment. As shown in the third row of Fig. |4j SDME 
does well on this task, with 434 codewords in the inter- 
section of the 500 most likely patterns in the data and 
the model; this is a much better performance than for 
the uncoupled model, and slightly better than for the 
conditionally independent model. 

The SDME model was constructed to capture ex- 
actly the total correlations in neuronal spiking, Qj = 
(xiXj) — (xi)(xj). With repeated stimulus, this total cor- 
relation can be broken down into the signal and noise 
components. The signal correlations, C-p are inferred 
by applying the same formula as for the total correla- 
tion, but on the spiking raster where the repeated trial 
indices have been randomly and independently permuted 
for each time bin. This removes any correlation due to 
interactions between spikes on simultaneously recorded 
trials, and only leaves the correlations induced by the 
response being locked to the stimulus. The noise cor- 
relation, Cfj, is then defined as the difference between 
the total and the signal components, CJJ = Cij — Cfj. 
We calculated the noise correlations between all pairs 
in our N = 100 neuron dataset. By their definition, 
the conditionally independent models cannot reproduce 
C™-, which are always zero. To assess the performance 
of the SDME, we drew samples from our model distri- 
bution using the Monte Carlo simulation and compared 
the noise correlations in the simulated rasters to the true 
noise correlations. The model prediction tightly corre- 
lates with the measured values, as shown in Fig. |5j We 
observe a systematic deviation of ~ 25%, most likely be- 
cause the assumed dependence on the stimulus through 
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FIG. 4: The performance of various models in account- 
ing for the total vocabulary of the population, P({xi}). 
The results for the SDME model are shown in (a), the results 
for an uncoupled set of LN models in (b), the results for a 
full conditionally independent model in (c). The first row 
displays the log ratio of model to empirical probabilities for 
various codewords (dots), as a function of that codeword's 
empirical frequency in the recorded data. The model prob- 
abilities were estimated by generating Monte Carlo samples 
from the corresponding model distributions (see Methods); 
only patterns that were generated in the MC run as well as 
found in the recorded data are shown. The second row sum- 
marizes this scatterplot by binning codewords according to 
their frequency, and showing the average log probability ra- 
tio in the bin (solid line), as well as the 1 std scatter across 
the codewords in the bin (shaded area). The highly probable 
all-silent state, {xi} = 0, is shown separately as a circle. The 
third row shows the overlap between 500 most frequent pat- 
terns in the data and 500 most likely patterns generated by 
the model (see text). 



one linear filter per neuron is insufficient to capture the 
complete dependence on stimulus, thereby underestimat- 
ing the full structure of stimulus correlation and inducing 
an excess in the noise correlation. Despite this, the de- 
gree of correspondence in noise correlations observed in 
Fig.|5]is telling us that SDME has clearly captured a large 
amount of noise covariance structure in neural firing. 

How should we interpret the inferred parameters of the 
SDME model? LN models have a clear "mechanistic" in- 
terpretation in terms of the cell's receptive field and the 
nonlinear spiking mechanism. Here, similarly, the stim- 
ulus dependent part of the model for each cell, c^, is a 
nonlinear function of a filtered version of the stimulus 
gi(t) = ki • s(t); in the absence of neuron-to- neuron cou- 
plings, the nonlinear ity of every neuron would correspond 
to Mi{gi) ~ f{pLi{9i)), where /(■) = exp(.)/(l + exp(-)), 
according to Eq. Q. The dependence of on the stim- 
ulus projection gi is similar across the recorded cells as 
shown in Fig. [6^i; as expected, higher overlaps with the 
linear filter induce higher probability of spiking. 

The pairwise interaction terms in the model, fy, are 
symmetric, static, and stimulus independent by construc- 
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FIG. 5: Measured vs predicted noise correlations for 
the SDME model. Noise correlation (see text) is estimated 
from recorded data for every pair of neurons, and plotted 
against the noise correlation predicted by the SDME model 
(each pair of neurons = one dot; shown are N(N — l)/2 dots 
for N — 100 neurons). Conditionally independent models 
predict zero noise correlation for all pairs. 



tion. As such, they represent only functional and not 
physical (i.e. synaptic) connections between the cells. 
Fig. ^6jp shows the pairwise interaction map for 100 cells; 
the histogram of their values (in Fig. [6^) reflects that they 
can be of both signs, but the distribution has a stronger 
positive tail, i.e. a number of cell pairs tend to spike 
together or be silent together with a probability than is 
higher than expected from their respective LN models. 
We can compare these interactions to the interactions 
of a static (non-stimulus-dependent) maximum entropy 
model for the population vocabulary [26l [28] : 



-yME 



d x i}) = ex P 
^0 



In this model for the total distribution of codewords, 
there is no stimulus dependence, and the parameters a® 
and /3®j are chosen to that the distribution is as random as 
possible, while reproducing exactly the measured mean 
firing rate of every neuron (xi}data — (xi) pme, and every 
pairwise correlation, (xiXj)data = (xiXj) P ME, across the 
whole duration of the experiment. 

Interestingly, we find that the pairwise interaction 
terms in the SDME model of Eq. ([3| are closely related 
to the interactions in the static pairwise maximum en- 
tropy model of Eq. (J5|: SDME interactions, fy, tend to 
be smaller in magnitude, but have an equal sign and rel- 
ative ordering, as the static ME interactions, Some 
degree of correspondence is expected: an interaction be- 
tween neurons i and j in the static ME model captures 
the combined effect of the stimulus and noise correlations, 
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FIG. 6: SDME model parameters, (a) Average values of 
the LN-like driving term, oti(gi), where gi = • s, across all 
cells i (error bars = std across cells), for each of the K = 20 
adaptive bins for gi (see Methods), (b) Pairwise interaction 
map fiij of the SDME model, between all N — 100 neurons in 
the experiment, (c) Histogram of pairwise interaction values 
from (b) , and their average value as a function of the distance 
between cells (inset), (d) For each pair of cells i and j, we 
plot the value of under the static maximum entropy model 
of Eq. |5| vs. the fiij from the stimulus-dependent maximum 
entropy model of Eq. ([3]). 



while in the corresponding SDME interaction, (most of) 
the stimulus correlation has been factored out into the 
correlated dynamics of the inputs to the neurons i and 
j, i.e. di(gi(t)) and aj(gj(t)). The surprisingly high de- 
gree of correspondence, however, indicates that even the 
interactions learned from static maximum entropy mod- 
els can account for, up to a scaling factor, the pairwise 
neuron dependencies that are not due to the correlated 
stimulus inputs. 

The SDME model is an approximation to the neu- 
ral codebook, P({^}|s), while the static ME model de- 
scribes the population vocabulary, P({xi}). With these 
two distributions in hand, we can explore how the pop- 
ulation jointly encodes the information about the stimu- 
lus into neural codewords — the joint activity patterns of 
spiking and silence. We make use of the fact that we can 
estimate the entropy of the maximum entropy distribu- 
tions using a procedure of heat capacity integration, as 
explained in Refs. [271 122] (see Methods). Information 
(in bits) per codeword is 



J ds P(s)^P({*J|s)log; 



{Xi} 

S[P({ Xi })]-(S[P({ Xl }\s)]) 



P({Xi}) 



that is, the information can be written as a difference 
of the entropy of the neural vocabulary, and the noise 
entropy (the average of the entropy of the codebook), 
where the entropy is S[p(x)] = — J dx p(x) \og 2 p(x). Be- 
cause of the maximum entropy property of our model 
for P ME ({xi}), the entropy of our static pairwise model 
in Eq. ([5| is an upper bound on the transmitted infor- 
mation; expressed as an entropy rate, this amounts to 
s = S[P ME ({x i })]/At w 730 bit/s. 

The brain does not have direct access to the stimulus, 
but only receives codewords {a^}, "drawn" from P({a^}), 
by the retina. At every moment in time, — log 2 P({#i}) 
measures the surprise about the output of the retina, 
and thus about the stimulus. We, as experimenters — but 
not the brain — have access to stimulus repeats and thus 
to P({xi}\s(t)), so we can compute the average value of 
surprise (per unit time) at every instant t in the stimulus: 



(7) 



{^i} 



P(s 



(6) 



This quantity can be expressed using the entropies and 
the learned parameters of our maximum entropy models, 
and is plotted as a function of time in Fig. [7[ Since aver- 
aging across time is equal to averaging over the stimulus 
ensemble, we see from Eq. ([7]) that (S(t)) t would have 
to be identically equal to *5[P({x})] under the condition 
that (P({xi}\s(t)))t = P({xi}) (marginalization). Since 
we build models for P({xi}) (static ME) and P({xi}\s) 
(SDME) from data independently, they need not obey 
the marginalization condition exactly, but they will do 
so if they provide a good account of the data. Indeed, by 
using the static ME and SDME distributions in Eq. Q 
for surprise, we find that (S(t)) t ~ 740 bit/s, very close 
to the entropy rate s of the total vocabulary and within 
our estimated 1% error bars on entropy computation. 

To estimate the information transmission, we have to 
subtract the noise entropy rate from the output entropy 
rate s, as dictated by Eq. ([6|. The entropy of the SDME 
model is an upper bound on the noise entropy; since this 
is not a lower bound, we cannot put a strict bound on 
the information transmission, but can nevertheless esti- 
mate it. Figure [7] shows the "instantaneous information" , 
l(t) = S(t) - S[P SDME ({x i }\s(t))]/At, as a function of 
time; from Eq. (|6|, the mutual information rate is a time 
average of this quantity, R = I({xi};s)/At = (l(t)) t . 
We find R w 130 bit/s. This quantity can be compared 
to the total entropy rate of the stimulus itself (which must 
be higher than i?), which in our case is w 210 bit/s (see 
Methods). While our estimates seem to indicate that a 
lot of vocabulary bandwidth (730 bit/s) is "lost" to noise 
(600 bit/s), the last comparison shows that the Gaussian 
FFF stimulus source itself is not very rich, so that the 
estimated information transmission takes up more than 
half of the actual entropy rate of the source. 

Lastly, we asked how important is the inclusion of 
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FIG. 7: Surprise and information transmission esti- 
mated from the SDME model, (a) Surprise rate (blue) is 
estimated from the static ME and SDME models assuming in- 
dependence of codewords across time bins. The instantaneous 
information rate (red) is the difference between the surprise 
and the noise entropy rate, estimated from the SDME model 
(see text). The information transmission rate is the average 
of the instantaneous information across time, (b) Population 
firing rate as a function of time shows that bursts of spiking 
strongly correlate with the bursts of surprise and information 
transmission in the population, (c) The stimulus (normalized 
to zero mean and unit variance) is shown for reference as a 
function of time. 



pairwise interactions, faj, into the SDME model, com- 
pared to a set of uncoupled LN models, when account- 
ing for information transmission. We therefore esti- 
mated the noise entropy rate for a set of uncoupled LN 
models, S[P LN ({xj|s(t))]/ At, which was found to be 
« 770 bit/s, considerably higher than the noise entropy 
of the SDME model. Crucially, this noise entropy rate 
is larger than the total entropy rate s estimated above, 
which is impossible for consistent models of the neural 
codebook and the vocabulary (since it would lead to 
negative information rates). This failure is a quantita- 
tive demonstration of the inability of the uncoupled LN 
models to reproduce the statistics of the population vo- 
cabulary, as shown in Fig. [4)3, despite a seemingly small 
performance difference on the level of single cell PSTH 
prediction. 



DISCUSSION 

We presented a modeling framework for stimulus en- 
coding by large populations of neurons, which combines 
an individual neuronal receptive field model, with the 
ability to include pairwise interactions between neurons. 
The result is a stimulus-dependent maximum entropy 
(SDME) model, which is the most parsimonious model of 
the population response to the stimulus that reproduces 
the linear-nonlinear (LN) aspect of single cells, as well as 
the correlation structure between neurons. In two limit- 
ing cases, the SDME model reduces to known models: if 



the single cell parameters a are static, SDME becomes 
the static maximum entropy model of the population vo- 
cabulary; if the couplings /3 are 0, SDME becomes a set 
of uncoupled LN models. The framework is general, and 
could be easily applied to other neural systems. 

We applied this modeling framework to the salaman- 
der retina presented with Gaussian white noise stimuli, 
and found that the interactions between neurons play 
an important role in determining the detailed patterns 
of population response. In particular, the SDME model 
gave better prediction of PSTH of single cells, yielded 
orders of magnitude improvement in describing the pop- 
ulation patterns, and captured significant aspects of noise 
correlations. The deviations between the SDME and the 
uncoupled LN model became significant for > 20 cells, 
and tended to occur at "interesting" times in the stimu- 
lus, precisely when the neural population was not silent. 

The responses of the neural system in the maximum 
entropy framework are binary codewords of spiking and 
silence across the neural population. The choice of 
timescale over which these codewords are defined, here 
At = 10 ms, is short enough such that multiple spikes are 
rarely observed in the same time bin, but long enough so 
that most of the strong spike- spike interactions (as well 
as fine temporal detail, such as spike-timing jitter) occur 
within a single bin. This allows us to view successive time 
bins as codewords, although some statistical dependence 
between them remains, possibly in the conditional SDME 
model (due to multiple timescales on which the neurons 
respond to stimuli), and certainly in the static ME model 
[3T] , If we were to make the time scale much shorter, 
e.g. by an order of magnitude, we could make the con- 
ditional independence assumption of the responses given 
the stimuli and previous spiking, which would lead us 
to GLM models [40] or nonequilibrium generalizations of 
Ising models [47]. GLMs, in particular, are excellent gen- 
erating models for precise spiking rasters, are easy to in- 
fer, and allow for asymmetric couplings between neurons. 
However, the inference in all these cases is tractable be- 
cause there are no interactions between the spikes within 
the same time bin (as there are in SDME). This neces- 
sitates the use of very short time bins and introduces 
strong dependencies between successive time bins, mak- 
ing the interpretation of the discretized neural responses 
in terms of individual codewords difficult. For this rea- 
son, GLM and SDME are complementary approaches: 
the first allows for a temporally-detailed probabilistic de- 
scription of a spiking process, while the second gives an 
explicit expression for the probability distribution over 
codewords in longer temporal bins. 

SDME allowed us to improve over LN models for sala- 
mander retinal ganglion cells both in terms of the PSTH 
prediction and, especially, in terms of population activity 
patterns. Interestingly, for parasol cells in the macaque 
retina under flickering checkerboard stimulation, the gen- 
eralized linear model did not yield firing rate improve- 
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ment relative to uncoupled LN models (but did improve 
higher order statistics of firing) [40]. In both cases, how- 
ever, the improvements reflect the role of dependencies 
among cells in encoding the stimulus, and their effect be- 
comes apparent when we ask questions about information 
transmission by a neural population. Maximum entropy 
models can only put an upper bounds on the total en- 
tropy and the noise entropy of the neural code (and this 
statement remains true even if successive codewords are 
not independent), and as such cannot set a strict bound, 
but only give an estimate, for the information transmis- 
sion. Nevertheless, ignoring the inter- neuron dependen- 
cies and using an uncoupled set of LN models predicts 
the total population responses so badly that the esti- 
mated noise entropy is higher than the upper bound on 
the total entropy, which is a clear impossibility, while the 
SDME model gives transmission rates that appear rea- 
sonable (and positive), amounting to about 60% of the 
source entropy rate. 

Tkacik and colleagues [61] have suggested that one can 
interpret fiij in an SDME model as a prior over the activ- 
ity patterns that the population would use to optimally 
encode the stimulus. For low noise level they argued 
that the prior should be minimal (and could help decor- 
relate the responses), as the population could faithfully 
encode the stimulus, whereas in the noisy regime, the 
prior should match the statistics of the sensory world and 
thus counteract the effects of noise. Similarly, Berkes and 
colleagues [62] suggested a similar reason for the similar- 
ity of ongoing and induced activity patterns in the visual 
cortex. Our results show that interactions are necessary 
for capturing the network encoding, and implicitly re- 
flect the existence of such a prior. The recovered inter- 
actions are strongly correlated with the interaction pa- 
rameters of a static, stimulus independent model over 
the distribution of patterns, making it possible for the 
brain (which only has access to the spikes, not the stim- 
ulus) to learn these values. Whether the interactions 
are matched to the statistics of the visual inputs as sug- 
gested by Ref [61 is the focus of future work. In parallel, 
increasingly detailed statistical models of neural codes 
going beyond SDME (e.g. by including temporal depen- 
dencies as in Ref [48 J, and efforts to infer such mod- 
els from experimental data, should focus our attention 
on population-level statistics and on finding principled 
information-theoretic measures for quantifying the code, 
like the surprise and instantaneous information suggested 
here. 



METHODS 

Electrophysiology. Experiments were performed on 
the adult tiger salamander, Ambystoma tigrinum. All ex- 
periments were in accordance with Ben-Gurion Univer- 
sity of the Negev and government regulations. Extracted 



retinas were placed with the ganglion cell layer facing a 
multielectrode array with 252 electrodes (Ayanda Biosys- 
tems, Switzerland), and superfused with oxygenated 
Ringer medium at room temperature. Extracellular ly 
recorded signals were amplified (MultiChannel Systems, 
Germany) and digitized at 10k Samples/s, and spike- 
sorted using custom software written in MATLAB. 

Visual stimulation. Stimuli were projected onto the 
retina from a CRT video monitor (ViewSonic G90fB) at 
a frame rate of 60 Hz; each movie frame was presented 
twice, using standard optics. Full Field Flicker (FFF) 
stimuli were generated by independently sampling spa- 
tially uniform gray levels (with a resolution of 8 bits) 
from a Gaussian distribution, with mean luminance of 
147 lux and the standard deviation of 33 lux. These 
data allow us to estimate the entropy rate of the source 
(as used in the main text), by multiplying the entropy 
of the luminance distribution with the refresh rate. To 
estimate the cells' receptive fields, checkerboard stimu- 
lus was generated by selecting each checker (~ 100 jam 
on the retina) randomly every 33 ms to be either black 
or white. To identify the RF centers, a two-dimensional 
Gaussian was fitted to the spatial profile of the response. 
The movies were gamma corrected for the computer mon- 
itor. In all cases the visual stimulus entirely covered the 
retinal patch that was used for the experiment. 

Inferring SDME from data. The LN model for 
each neuron i consists of the linear filter Iq, and the non- 
linear function Mi , which is defined pointwise on a set of 
binned values for the generator signal, gi = Iq • s. We 
used binning into K = 20 bins such that initially each 
bin contains roughly the same number of values for g^ 
but subsequently the binning is adaptively adjusted (sep- 
arately for each neuron) to be denser at higher values of 
gi, where the firing rates are higher. We fitted LN models 
with varying number of K bins, and have chosen K = 20 
when the performance of the LN models appeared to sat- 
urate [63] . 

To find the parameters of the stimulus-dependent max- 
imum entropy model (c^(gi), /%), we retained the bin- 
ning of the generator signal used for LN model construc- 
tion. Given trial values for the SDME parameters, we es- 
timated the chosen expectation values (covariance matrix 
Cij in firing, and the firing rate conditional on r^(^)) 
by Monte Carlo sampling from the trial distribution in 
Eq. the learning step of the algorithm is computed by 
comparing the expectation values in the trial distribution 
and the empirical distribution (computed over the train- 
ing half of the stimulus repeats). In detail, we used a gra- 
dient ascent algorithm, applying a combination of Gibbs 
sampling and importance sampling in order to efficiently 
estimate the gradient, by using optimizations similar to 
those described in Ref. [63] . Sampling was carried out 
in parallel on a 16 node cluster with two 2.66GHz Intel 
Quad-Core Xeon processors and 16GB of memory per 
node. The calculation was terminated when the average 
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error in firing rates and coincident firing rates reached 
below 1% and 5% respectively, which is within the ex- 
perimental error. 

To compute the single neuron PSTH and compare the 
distributions of codewords from the model to the empir- 
ical distribution, we used Metropolis Monte Carlo sam- 
pling to draw codewords from the model distributions; 
we drew 5000 independent samples (to draw uncorre- 
cted configurations, a sample was recorded only after 
100 "spin-flip" trials) for every timepoint, for a total of 
5 • 10 6 samples; the same procedure was used also to 
draw from the uncoupled (f3 = 0) models. To estimate 
the entropies of high dimensional SDME distributions, 
we used the "heat capacity integration" method, detailed 
in Ref [32]. Briefly, a maximum entropy model P(x) = 
Z~ x exp(— E(x.)) (where E is the Hamiltonian function 
determined by the choice of constrained operators and 
the conjugated parameters) is extended by introducing a 
new parameter T, much like the temperature in physics, 
so that P T (x) = Z~ x exp(-£(x)/T). The entropy of the 
distribution is given by S[P T =i] = Jq C(T)/TdT, where 
the heat capacity C(T) = a^(T)/T 2 , and the variance 
in energy can be estimated at each T by Monte Carlo 
sampling. In practice, we run a separate Monte Carlo 
sampling for a finely discretized interval of temperatures, 
T G [0, 1], estimate C(T) for each temperature, and nu- 
merically integrate to get the entropy S. We have pre- 
viously shown that this procedure yields robust entropy 
estimates even for large numbers of neurons [27[ [32] . 
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